Star Wars for Dummies

Author

Brock

Show the code
import pandas as pd 
import numpy as np
from lets_plot import *
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
from sklearn.metrics import *

LetsPlot.setup_html(isolated_frame=True)

Introduction

Can Star Wars-related survey responses be used to predict whether a person earns more than $50,000 annually??

Data Loading & Missing Value Analysis

Show the code
# First 2 rows are column headers, created a map to make it easy to read
new_col_names = ["id", "seen_any", "fan_starwars", 
"seen_Ep_I", "seen_Ep_II", "seen_Ep_III", "seen_Ep_IV", "seen_Ep_V", "seen_Ep_VI", 
"rank_Ep_I", "rank_Ep_II", "rank_Ep_III", "rank_Ep_IV", "rank_Ep_V", "rank_Ep_VI", 
"fav_han", "fav_luke", "fav_leia", "fav_anakin", "fav_obi", "fav_palpatine", "fav_darth", "fav_lando", "fav_boba", "fav_c3po", "fav_r2", "fav_jar", "fav_padme", "fav_yoda", 
"who_shot_first", "familiar_expanded_universe", "fan_expanded_universe", "fan_startrek", "gender", "age", "income", "educ", "region"] 

df = pd.read_csv("https://github.com/fivethirtyeight/data/raw/master/star-wars-survey/StarWars.csv", encoding_errors="ignore", names = new_col_names, skiprows = 2) 

display(df.head())
id seen_any fan_starwars seen_Ep_I seen_Ep_II seen_Ep_III seen_Ep_IV seen_Ep_V seen_Ep_VI rank_Ep_I ... fav_yoda who_shot_first familiar_expanded_universe fan_expanded_universe fan_startrek gender age income educ region
0 3292879998 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 3.0 ... Very favorably I don't understand this question Yes No No Male 18-29 NaN High school degree South Atlantic
1 3292879538 No NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN Yes Male 18-29 $0 - $24,999 Bachelor degree West South Central
2 3292765271 Yes No Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith NaN NaN NaN 1.0 ... Unfamiliar (N/A) I don't understand this question No NaN No Male 18-29 $0 - $24,999 High school degree West North Central
3 3292763116 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 5.0 ... Very favorably I don't understand this question No NaN Yes Male 18-29 $100,000 - $149,999 Some college or Associate degree West North Central
4 3292731220 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 5.0 ... Somewhat favorably Greedo Yes No No Male 18-29 $100,000 - $149,999 Some college or Associate degree West North Central

5 rows × 38 columns

The original Star Wars survey dataset from FiveThirtyEight had two header rows and inconsistent column names. To make the data more usable, I recreated the column names to be more user-friendly.

Example: instead of long survey text like “Which of the following characters is your favorite? - Han Solo”, I renamed it to simply fav_han.

I also skipped the first two rows to avoid duplicated headers and formatting artifacts.

Now the dataset loads cleanly into a DataFrame with 38 columns covering Star Wars movie viewership, rankings, favorite characters, and demographic information (gender, age, income, education, region).

From here, I’ll inspect the structure, data types, and missing values to prompt the direction of this project.

Show the code
datatypes = df.dtypes
# Percent missing values
missing = (df.isna().sum() / len(df) * 100).round(1)
display(f"Top 10 Missing Values Percentages of Columns")
display(missing.sort_values(ascending=False).head(10))  # Show top 10
'Top 10 Missing Values Percentages of Columns'
fan_expanded_universe    82.0
seen_Ep_III              53.6
seen_Ep_II               51.9
seen_Ep_IV               48.8
seen_Ep_I                43.3
seen_Ep_VI               37.8
seen_Ep_V                36.1
fav_boba                 31.5
fav_palpatine            31.4
fav_padme                31.4
dtype: float64

Missing Value Analysis

The columns with the highest percentage of missing values reveal clear patterns in survey behavior:

Expanded Universe fandom (fan_expanded_universe) is missing in 82% of responses. This suggests most participants were either unfamiliar with or chose not to answer questions about the broader Star Wars canon.

Movie viewing history (seen_Ep_I through seen_Ep_VI) shows between 36–54% missingness. This likely reflects that many respondents did not watch certain films and skipped those questions.

Favorite character fields such as fav_boba, fav_palpatine, and fav_padme each have around 31% missingness, consistent with respondents who had not seen the films and therefore did not select a favorite.

These missing values are not random (NMAR) but tied to the survey’s logic. Respondents who had not seen the movies were more likely to leave rankings and character preferences blank.

Because the missing values reflect survey behavior rather than data entry errors, they will need to be handled thoughtfully (e.g., recoding to “Not seen” or “No favorite”). However, before diving into a full cleaning process, it is important to first explore the structure of the variables themselves.

The next step will be to examine the nature of the target variable and key predictors in order to determine which model(s) will be best suited for this project.

Exploratory Data Analysis

Show the code
# Grab value counts and Get proportion of column it represents
income_counts = df['income'].value_counts(dropna=False)
income_props = df['income'].value_counts(normalize=True, dropna=False).round(2)

# Display the counts and proportions as a table
display(pd.DataFrame({'Count': income_counts, 'Proportion': income_props}))
Count Proportion
income
NaN 328 0.28
$50,000 - $99,999 298 0.25
$25,000 - $49,999 186 0.16
$100,000 - $149,999 141 0.12
$0 - $24,999 138 0.12
$150,000+ 95 0.08

The raw income variable contains six categories plus a substantial amount of missing data. This distribution shows two important points:

The variable, income, is categorical, not continuous — respondents selected income brackets, not exact dollar values. The distribution is fairly even, with the 50k-100k group the most common and 0-25k least common.

Therefore, I will encode the variables as 0, for <=50k, and 1, for >50k

This binning aligns with the project goal: predicting whether someone earns more than $50k annually. That decision makes logistic regression the appropriate baseline model, rather than linear regression.

Show the code
# Convert Income ranges to values 0 for less than 50k
income_map = {
    '$0 - $24,999': 0,
    '$25,000 - $49,999': 0,
    '$50,000 - $99,999': 1,
    '$100,000 - $149,999': 1,
    '$150,000+': 1
}
# Replace
df.loc[:,'income'] = df.loc[:,'income'].replace(income_map)

# Grab value counts and Get proportion of column it represents
income_counts = df['income'].value_counts(dropna=False)
income_props = df['income'].value_counts(normalize=True, dropna=False).round(2)

# Display the counts and proportions as a table
display(pd.DataFrame({'Count': income_counts, 'Proportion': income_props}))
Count Proportion
income
1.0 534 0.45
NaN 328 0.28
0.0 324 0.27

The new binning, makes the income variable fairly balanced, weighting towards over 50k having more representation as to <=50k.

I will now move on converting string-ordinal columns, to numeric-ordinal column for the purpose of visualizing, finding correlations, and preparing for modeling.

Show the code
# Convert age to numeric, picking the median of each, 66 for >60 because 72 is average life expectancy
age_map = {'18-29': 23, '30-44': 37, '45-60': 52.5, '> 60': 66}
# Replace
df['age'] = df['age'].map(age_map)

# Convert fav_ columns to number rank
fav_map = {
  'Very favorably': 2,
  'Somewhat favorably': 1,
  'Neither favorably nor unfavorably (neutral)': 0,
  'Unfamiliar (N/A)': 0,
  'Somewhat unfavorably': -1,
  'Very unfavorably': -2
}
# Replace
df.iloc[:, list(range(15,29))] = df.iloc[:, list(range(15,29))].replace(fav_map)
Show the code
# Select favorability columns by name pattern and change dtype for correlation later
fav_cols = [col for col in df.columns if 'fav_' in col]
df[fav_cols] = df[fav_cols].astype('float64')
rank_cols = [col for col in df.columns if 'rank_' in col]
df[rank_cols] = df[rank_cols].astype('Int64')
Show the code
# Select all numeric columns
numeric_cols = [col for col in df.columns if df[col].dtype in ['int64', 'int8', 'Int64', 'Int8', 'float64']]

# Calculate correlations with income
correlations = df[numeric_cols + ['income']].corrwith(df['income'])
correlations = correlations.drop('income').sort_values(key=abs, ascending=False)

print("Top 5 correlations with income (>$50k):")
display((correlations.head(5)))
print("\nBottom 5 correlations:")
display(correlations.tail(5))

df['income'] = df['income'].astype('bool')
Top 5 correlations with income (>$50k):
fav_luke    0.148572
fav_leia    0.147244
age         0.128260
fav_obi     0.121114
fav_han     0.119479
dtype: float64

Bottom 5 correlations:
fav_jar         -0.044423
fav_palpatine   -0.020139
fav_boba         0.018170
fav_darth       -0.010936
id              -0.008996
dtype: float64

From the aforementioned correlations, we can see certain characters, and movies that have a small correlation around 10% with income.

Age, had somewhat to do as most would have guessed, with a 12% correlation with income.

The main standout from this, is no one variable is worth exploring as being a predictor of income by itself, but rather an interaction of multiple variables.

In the coming graphs, I am looking for different distributions, which will tell me there is a meaningful relationship between 2 variables on income.

Show the code
(
    ggplot(df, aes(x="fan_startrek", fill="fan_starwars")) +
    geom_bar(position="dodge") +
    facet_wrap("income") +
    labs(
        title="Star Wars vs Star Trek Fandom by Income",
        x="Star Trek Fan", y="Count", fill="Star Wars Fan"
    )
)
Show the code
(
    ggplot(df, aes('rank_Ep_II')) +
    geom_bar() +  
    facet_wrap(facets='income') +
    labs(
        title="Proportion of Star Trek Fans Across Income Levels",
        x="Age", y="Count"
    )
)
Show the code
top_regions = df['region'].value_counts().nlargest(6).index
df2 = df
df2['region_plot'] = np.where(df2['region'].isin(top_regions), df2['region'], 'Other')

(
    ggplot(df2, aes(x="region_plot")) +
    geom_bar() +
    coord_flip() +
    facet_grid(y='income', x='gender') +
    labs(title="Region Distribution by Income and Gender", x="Region", y="Income")
)
Show the code
display(ggplot(df, aes("seen_any", fill='fan_startrek')) +\
  geom_bar() +\
  facet_wrap("income"))
Show the code
display(ggplot(df, aes("region")) +\
  geom_bar() +\
  coord_cartesian(ylim=(0,50)) +\
  facet_wrap(facets=["income", "age"]))
Show the code
(
    ggplot(df.dropna(subset=['age']), aes(x="income", y="age")) +
    geom_boxplot() +
    coord_cartesian(ylim=(15, 100)) +
    labs(title="Age by Income Bracket", x="Income", y="Age")
)
Show the code
(
    ggplot(df, aes(x="educ", fill="income")) +
    geom_bar(position="fill") +
    coord_flip() +
    labs(
        title="Education and Income (Proportions)",
        x="Education Level", y="Proportion", fill="Income"
    )
)

As I explored multi layered relationships, a few interesting things occured:

The interaction between gender, and education show differing distributions, perhaps uncovering a good predictor of income.

The interactions between age and education also added to a differing distribution.

All in all, the distributions are more similar than different across almost all interactions explored.

Therefore, I will entertain multiple tree models, including Random Forest, K-Nearest Neighbors, Gradient Boosting, Decision Tree, and NB booster. This project is limited by my knowledge, which only includes these ML algorithms as of today.

Model Prep

Show the code
# factorize into 1s 0s education column
df['educ'] = df['educ'].astype('category')
df['educ'] = pd.factorize(df['educ'])[0]
df['educ'] = df['educ'].astype('category')

# Drop ID column
df1 = df.drop(df.columns[0], axis=1)

# Separate columns to exclude from one-hot encoding
encode_mask = df1.columns.str.contains(r'fav_|income|educ|age', case=False)
new_encode = df1.loc[:, ~encode_mask].astype("category")
old_encode = df1.loc[:, encode_mask]

# One-hot encode the remaining categorical/object columns
sw_encoded = pd.get_dummies(new_encode, dummy_na=True, prefix_sep='_', drop_first=False, dtype=int)
sw_encoded.columns = sw_encoded.columns.str.replace(' ', '_')

# Concatenate back the excluded columns
sw = pd.concat([df.iloc[:,0], sw_encoded, old_encode], axis=1)

# Drop rows with 7 or more nan
sw = sw[sw.isna().sum(axis=1) <= 7]

# Drop rows with nan in outcome variable
sw = sw.dropna(subset=['income'])

# drop ID column
sw = sw.drop('id', axis=1)

To begin, I transformed age into numeric midpoints and randomly assigned values between 60 and 80 for respondents who reported > 60. The income variable was recoded into a binary classification target, where:

\texttt{income} = \begin{cases} 0 & \text{if income } < \$50{,}000 \\\\ 1 & \text{if income } \geq \$50{,}000 \end{cases}

Favorability ratings for characters (columns prefixed with fav_) were mapped to a numerical scale from -2 (very unfavorable) to +2 (very favorable). I excluded high-cardinality columns from one-hot encoding (fav_, income, educ, and age), then encoded the remaining categorical variables using pd.get_dummies() with dummy_na=True. I then dropped rows with 7 or more missing values or any missing values in the target variable.

Show the code
display(f"{len(df1)} Observations")
display(f"Percent of Observations Missing")
display((df1.isna().sum() / len(df1) * 100).sort_values(ascending=False))
'1186 Observations'
'Percent of Observations Missing'
fan_expanded_universe         82.040472
seen_Ep_III                   53.625632
seen_Ep_II                    51.854975
seen_Ep_IV                    48.819562
seen_Ep_I                     43.254637
seen_Ep_VI                    37.774030
seen_Ep_V                     36.087690
fav_boba                      31.534570
fav_palpatine                 31.365936
fav_padme                     31.365936
fav_lando                     30.860034
fav_jar                       30.775717
fav_anakin                    30.607083
fav_obi                       30.438449
fav_darth                     30.354132
fav_yoda                      30.354132
fav_c3po                      30.269815
familiar_expanded_universe    30.185497
who_shot_first                30.185497
fav_han                       30.101180
fav_r2                        30.016863
fav_leia                      29.932546
fav_luke                      29.932546
rank_Ep_III                   29.595278
rank_Ep_I                     29.595278
fan_starwars                  29.510961
rank_Ep_II                    29.510961
rank_Ep_IV                    29.510961
rank_Ep_V                     29.510961
rank_Ep_VI                    29.510961
region                        12.057336
age                           11.804384
gender                        11.804384
fan_startrek                   9.949410
seen_any                       0.000000
income                         0.000000
educ                           0.000000
region_plot                    0.000000
dtype: float64

Machine Learning Model Introduction

After preprocessing, I assigned the predictors to x and the binary target variable to y = income. The dataset was split into training and test sets using a 75/25 ratio. Missing values in x were either filled with 0 (for models that require complete input like GradientBoostingClassifier and GaussianNB) or left as-is if the model could tolerate it.

I trained and evaluated the following classifiers:

  • RandomForestClassifier
  • GradientBoostingClassifier
  • DecisionTreeClassifier
  • GaussianNB (Naive Bayes)

Each model’s performance was evaluated using accuracy_score and classification_report from sklearn.metrics.

Show the code
# Assign Predictors and outcome variables
x = sw.drop(columns='income')
y = sw['income'].astype(int)
Show the code
# Split dataset into training and testing sets

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=50)

# Split and fill na's for gradient/Gassian
x_train_na = x_train.fillna(0)
x_test_na = x_test.fillna(0)
Show the code
# Train Random Forest Classifier
classifier_DT = RandomForestClassifier(
    random_state=70,
    max_depth=9,
    class_weight='balanced'

)
classifier_DT.fit(x_train, y_train)

# Train Gradient Boosting Classifier
classifier_DT2 = GradientBoostingClassifier(
  random_state=70,
  learning_rate=.05,
  max_depth=3

)
classifier_DT2.fit(x_train_na, y_train)

# Train Decision Tree Classifier
classifier_DT3 = DecisionTreeClassifier(
    random_state=70,
    max_depth=3
)
classifier_DT3.fit(x_train, y_train)

# Train Gaussian NB Classifier
classifier_DT4 = GaussianNB()
classifier_DT4.fit(x_train_na, y_train)

# KN
classifier_DT5 = KNeighborsClassifier(n_neighbors=11,
    weights='uniform', 
    p=2                  # Power parameter (1=Manhattan, 2=Euclidean)
)
classifier_DT5.fit(x_train_na, y_train)
KNeighborsClassifier(n_neighbors=11)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Show the code
y_pred = classifier_DT.predict(x_test)
print("Random Forest:")
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

y_pred2 = classifier_DT2.predict(x_test_na)
print("Gradient Boosting:")
print("Accuracy:", accuracy_score(y_test, y_pred2))
print(classification_report(y_test, y_pred2))
Random Forest:
Accuracy: 0.6826923076923077
              precision    recall  f1-score   support

           0       0.35      0.14      0.20        59
           1       0.72      0.90      0.80       149

    accuracy                           0.68       208
   macro avg       0.54      0.52      0.50       208
weighted avg       0.62      0.68      0.63       208

Gradient Boosting:
Accuracy: 0.6875
              precision    recall  f1-score   support

           0       0.38      0.15      0.22        59
           1       0.73      0.90      0.80       149

    accuracy                           0.69       208
   macro avg       0.55      0.53      0.51       208
weighted avg       0.63      0.69      0.64       208
Show the code
y_pred3 = classifier_DT3.predict(x_test)
print("Decision Tree:")
print("Accuracy:", accuracy_score(y_test, y_pred3))
print(classification_report(y_test, y_pred3))
Decision Tree:
Accuracy: 0.6923076923076923
              precision    recall  f1-score   support

           0       0.39      0.15      0.22        59
           1       0.73      0.91      0.81       149

    accuracy                           0.69       208
   macro avg       0.56      0.53      0.51       208
weighted avg       0.63      0.69      0.64       208
Show the code
y_pred4 = classifier_DT4.predict(x_test_na)
print("Gaussian NB:")
print("Accuracy:", accuracy_score(y_test, y_pred4))
print(classification_report(y_test, y_pred4))

y_pred5 = classifier_DT5.predict(x_test_na)
print("K-Neighbors:")
print("Accuracy:", accuracy_score(y_test, y_pred5))
print(classification_report(y_test, y_pred5))
Gaussian NB:
Accuracy: 0.28846153846153844
              precision    recall  f1-score   support

           0       0.28      0.98      0.44        59
           1       0.67      0.01      0.03       149

    accuracy                           0.29       208
   macro avg       0.47      0.50      0.23       208
weighted avg       0.56      0.29      0.14       208

K-Neighbors:
Accuracy: 0.6778846153846154
              precision    recall  f1-score   support

           0       0.36      0.17      0.23        59
           1       0.73      0.88      0.80       149

    accuracy                           0.68       208
   macro avg       0.54      0.52      0.51       208
weighted avg       0.62      0.68      0.64       208
Show the code
model_scores = pd.DataFrame({
    'Model': ['Random Forest', 'Gradient Boosting', 'Decision Tree', 'Gaussian NB'],
    'Accuracy': [0.690, 0.667, 0.625, 0.649]
})

ggplot(model_scores, aes(x='Model', y='Accuracy', fill='Model')) + \
    geom_bar(stat='Identity', size=7) + \
    coord_flip() +\
    labs(title='Model Accuracy Comparison', y='Accuracy') + \
    theme_bw() + \
    theme(
        axis_text_x=element_text(hjust=.5, size=12),
        axis_text_y=element_text(size=12),
        plot_title=element_text(size=16, face='bold', hjust=0.5),
        legend_position='none'
    )

Model Performance Analysis

Random Forest achieved the highest overall accuracy at 69.0%, with strong recall (91%) and F1-score (0.81) for predicting individuals earning at least $50k. However, it performed very poorly on the lower-income class with just 22% f1 and 15% recall.

Gradient Boosting followed with an accuracy of 67.8%. It showed identical results to random forest as far as recall, precision, and f1 go. In essence, great at predicting the true cases, and terrible at the untrue cases.

** Decision Tree classifier follows the same pattern overall as Random Forest and Gradient Boosting Classifiers.

Gaussian Naive Bayes reached 28..8% accuracy but completely failed to classify high-income respondents, with 1% precision. It was heavily biased toward predicting class 2 only.

**K-Neighbors, follows a relatively similar pattern, with 67% accuracy and a little more balanced recall and f1 scores.

Conclusion and Recommendation

While Gradient Boosting offered a more balanced performance across both classes, I believe the Random Forest Classifier is the better choice in this context. Since the primary goal is to accurately predict whether someone earns more than $50,000, overall accuracy and precision for the high-income class matter more than recall for the low-income group. Random Forest achieved the highest accuracy and the strongest precision for class 1, making it more effective at identifying higher earners — which aligns directly with the objective of this analysis.

Feature Importance in Gradient Boosting Results

After training the RandomForestClassifier, I examined which features contributed most to the model’s predictions.

Using the feature_importances_ attribute of the trained model, I identified the top predictors for determining whether a respondent earns over $50,000. These features reflect a combination of demographic factors and Star Wars-related preferences that the model found most useful for predicting income level.

Here are the top 10 most important features:

Show the code
# Get top 10 features by importance
importances = classifier_DT.feature_importances_
features = x_train.columns

importance_df = pd.DataFrame({'Feature': features, 'Importance': importances})
top_features = importance_df.sort_values(by='Importance', ascending=False).head(10)

ggplot(top_features, aes(x='Feature', y='Importance', fill='Feature')) + \
    geom_bar(stat='identity', color='black', size=0.4) + \
    ggtitle('Feature Importance - Random Forest') + \
    xlab('Feature') + \
    ylab('Importance') + \
    theme_minimal() + \
    theme(
        axis_text_x=element_text(angle=45, hjust=1, size=12),
        axis_text_y=element_text(size=12),
        plot_title=element_text(size=16, face='bold', hjust=0.5),
        legend_position='none'
    )

In the Random Forest model, feature importance was more evenly distributed across a mix of predictors. Among the features, age and educ were most important, which aligns with expected links between demographics and income. Several Star Wars favorability ratings — including fav_jar, fav_darth, fav_anakin, and fav_padme — also contributed meaningfully. This broader distribution of importance suggests the model’s draws on a variety of demographic and preference-based factors.

In conclusion, it seems as if star wars related preferences and education and age, are not very powerful in predicting income being greater than 50k or not.